204 research outputs found
On Tail Index Estimation based on Multivariate Data
This article is devoted to the study of tail index estimation based on i.i.d.
multivariate observations, drawn from a standard heavy-tailed distribution,
i.e. of which 1-d Pareto-like marginals share the same tail index. A
multivariate Central Limit Theorem for a random vector, whose components
correspond to (possibly dependent) Hill estimators of the common shape index
alpha, is established under mild conditions. Motivated by the statistical
analysis of extremal spatial data in particular, we introduce the concept of
(standard) heavy-tailed random field of tail index alpha and show how this
limit result can be used in order to build an estimator of alpha with small
asymptotic mean squared error, through a proper convex linear combination of
the coordinates. Beyond asymptotic results, simulation experiments illustrating
the relevance of the approach promoted are also presented
Learning Reputation in an Authorship Network
The problem of searching for experts in a given academic field is hugely
important in both industry and academia. We study exactly this issue with
respect to a database of authors and their publications. The idea is to use
Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) to perform
topic modelling in order to find authors who have worked in a query field. We
then construct a coauthorship graph and motivate the use of influence
maximisation and a variety of graph centrality measures to obtain a ranked list
of experts. The ranked lists are further improved using a Markov Chain-based
rank aggregation approach. The complete method is readily scalable to large
datasets. To demonstrate the efficacy of the approach we report on an extensive
set of computational simulations using the Arnetminer dataset. An improvement
in mean average precision is demonstrated over the baseline case of simply
using the order of authors found by the topic models
Functional Bipartite Ranking: a Wavelet-Based Filtering Approach
It is the main goal of this article to address the bipartite ranking issue
from the perspective of functional data analysis (FDA). Given a training set of
independent realizations of a (possibly sampled) second-order random function
with a (locally) smooth autocorrelation structure and to which a binary label
is randomly assigned, the objective is to learn a scoring function s with
optimal ROC curve. Based on linear/nonlinear wavelet-based approximations, it
is shown how to select compact finite dimensional representations of the input
curves adaptively, in order to build accurate ranking rules, using recent
advances in the ranking problem for multivariate data with binary feedback.
Beyond theoretical considerations, the performance of the learning methods for
functional bipartite ranking proposed in this paper are illustrated by
numerical experiments
Ranking the best instances
We formulate the local ranking problem in the framework of bipartite ranking
where the goal is to focus on the best instances. We propose a methodology
based on the construction of real-valued scoring functions. We study empirical
risk minimization of dedicated statistics which involve empirical quantiles of
the scores. We first state the problem of finding the best instances which can
be cast as a classification problem with mass constraint. Next, we develop
special performance measures for the local ranking problem which extend the
Area Under an ROC Curve (AUC/AROC) criterion and describe the optimal elements
of these new criteria. We also highlight the fact that the goal of ranking the
best instances cannot be achieved in a stage-wise manner where first, the best
instances would be tentatively identified and then a standard AUC criterion
could be applied. Eventually, we state preliminary statistical results for the
local ranking problem.Comment: 29 page
Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics
In a wide range of statistical learning problems such as ranking, clustering
or metric learning among others, the risk is accurately estimated by
-statistics of degree , i.e. functionals of the training data with
low variance that take the form of averages over -tuples. From a
computational perspective, the calculation of such statistics is highly
expensive even for a moderate sample size , as it requires averaging
terms. This makes learning procedures relying on the optimization of
such data functionals hardly feasible in practice. It is the major goal of this
paper to show that, strikingly, such empirical risks can be replaced by
drastically computationally simpler Monte-Carlo estimates based on terms
only, usually referred to as incomplete -statistics, without damaging the
learning rate of Empirical Risk Minimization (ERM)
procedures. For this purpose, we establish uniform deviation results describing
the error made when approximating a -process by its incomplete version under
appropriate complexity assumptions. Extensions to model selection, fast rate
situations and various sampling techniques are also considered, as well as an
application to stochastic gradient descent for ERM. Finally, numerical examples
are displayed in order to provide strong empirical evidence that the approach
we promote largely surpasses more naive subsampling techniques.Comment: To appear in Journal of Machine Learning Research. 34 pages. v2:
minor correction to Theorem 4 and its proof, added 1 reference. v3: typo
corrected in Proposition 3. v4: improved presentation, added experiments on
model selection for clustering, fixed minor typo
On Anomaly Ranking and Excess-Mass Curves
Learning how to rank multivariate unlabeled observations depending on their
degree of abnormality/novelty is a crucial problem in a wide range of
applications. In practice, it generally consists in building a real valued
"scoring" function on the feature space so as to quantify to which extent
observations should be considered as abnormal. In the 1-d situation,
measurements are generally considered as "abnormal" when they are remote from
central measures such as the mean or the median. Anomaly detection then relies
on tail analysis of the variable of interest. Extensions to the multivariate
setting are far from straightforward and it is precisely the main purpose of
this paper to introduce a novel and convenient (functional) criterion for
measuring the performance of a scoring function regarding the anomaly ranking
task, referred to as the Excess-Mass curve (EM curve). In addition, an adaptive
algorithm for building a scoring function based on unlabeled data X1 , . . . ,
Xn with a nearly optimal EM is proposed and is analyzed from a statistical
perspective
- …